Data Variable:
## [1] "X" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality" "rating"
Data Structure:
## 'data.frame': 1599 obs. of 14 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ rating : chr "normal" "normal" "normal" "normal" ...
Data Summary:
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality rating
## Min. : 8.40 Min. :3.000 Length:1599
## 1st Qu.: 9.50 1st Qu.:5.000 Class :character
## Median :10.20 Median :6.000 Mode :character
## Mean :10.42 Mean :5.636
## 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :14.90 Max. :8.000
Quality of red wine:
## [1] 5 6 7 4 8 3
start with the distribution of individual variable:
fixed.acidity:
There are some outliers above 15 The distribution has high concentratin around 8
volatile.acidity:
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
There are some outliers above 1.3, and two peaks at 0.4 and 0.6
citric.acid:
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
There are some outliers above 1, and many of the red wines have value 0
Check it
##
## FALSE TRUE
## 1467 132
## [1] 0.08255159
about 8% of the red wine has value 0
residual.sugar:
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
There are some outliers above 10, and hight concentration around 2.3
chlorides:
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
There are some outliers above 0.5, and hight concentration around 0.08
free.sulfur.dioxide:
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
There are some outliers above 60, and most of the values are around 5~20
total.sulfur.dioxide:
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
There are some outliers above 175, and most of the values are around 25~75
density:
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect
There is a central peak at 0.997, looks like has normal distribution
pH:
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
There is a central peak at 3.3, also looks like have normal distribution
sulphates:
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
There are some outliers above 1.4
alcohol:
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
There is a peak around 9.5 and rapid decrease after it, besides there no red wine has value below 8
quality:
Most of the wine quality are around 5 and 6
There are 1599 observations and 12 features. One categorical feature(quality) and others are numerical features.
I think the main feature is quality. People care about quality rather than other features
Those other features may be all helpful. Because those features may have influence on taste. For example residual.sugar will indicate how sweet is the wine. And the acid features will relate to the acit flavour. SO2 also being regard as an important ingredient in red wine whick will influent taste
I create rating variable based on quality. Wine with quality below 5 as bed, and above 7 as good, others will regard as normal
Citric.acid has a lot of data with value 0, it’s really unexpected.
In order to quickly get info of each pair variable using ggpair:
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## Warning: position_stack requires constant width: output may be incorrect
## Warning: position_stack requires constant width: output may be incorrect
## Warning: position_stack requires constant width: output may be incorrect
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
From first of view, it seems quality has more strong correlation with Alcohol and volatile.acidity, sulphates, citric.acid
Alcohol has strong negtive correlation with density, more alcohol in the wine will cause density to be lower
Sulphates and chlorides has strong correlation, so as Sulphates and citric.acid pH has strong correlation with fixed.acidity, citric.acid, volitile.acidity, it’s not surprised
total.sulfur.dioxide and free.sulfur.dioxide has strong correlation citric.acid and volatile.acidity and fixed.acidity all has strong correlation with each other.
Create quality with Alcohol and volatile.acidity, sulphates, citric.acid plots
It seems higher alcohol has better quality
lower volatile.acidity will have better quality
It seems high quality wine has a little higher sulphates
It shows high quality wine has higher citric.acid
Now let’s check sugar with quality
It’s surprised that wine quality has no strong correlation with sugar
Since fixed.acidity, volatile.acidity and citric.acid has strong correlation with pH and quality It’s strange that pH and quality do not have strong correlation Plot to check it
It seems high quality has a little bit lower pH, however there are many outliers.
alcohol and desity has negtive correlation
It seems higher alcohol has better quality
Lower volatile.acidity will have better quality
It seems high quality wine has a little higher sulphates
It shows high quality wine has higher citric.acid
From above it seems higher quality wine has more acid, so the acid should be lower, however there are many outliers.
Alcohol has strong negtive correlation with density, more alcohol in the wine will cause density to be lower
Sulphates and chlorides has strong correlation, so as Sulphates and citric.acid
pH has strong correlation with fixed.acidity, citric.acid, volitile.acidity, it’s not surprised
total.sulfur.dioxide and free.sulfur.dioxide has strong correlation
citric.acid and volatile.acidity and fixed.acidity all has strong correlation with each other.
From correlation plot density has strongest negtive correlation with fixed.acidity
First I plot alcohol & density over rating
We can see that, bad rating wine locates on left upper, and good rating wine locates on right bettom
Then we check acid with quality
I can’t get any insight from this
It seems higher citric.acid with lower pH will be better quality wine, but the difference is small
It seems lower volatile.acidity with lower pH will be better quality wine
Now check with sulphates:
sulphates higher will be better, and it seems no difference with density
higher alcohol with higher sulphates will be better
sulphates higher will be better, and it seems no difference with pH
lower volatile.acidity alcohol with higher sulphates will be better
It seems high alcohol and high sulphates will get better quality wine, as same as, lower volatile.acidity alcohol with higher sulphates will be better
Since acid features affect wine quality, I expect pH with acid features will disclose some info However, it cannot see pH with acid features has any clear effects on quality
People may think residual sugar will affect the quality of wine, but it’s the wrong concept. This plot shows that residual sugar doesn’t affect the quality
We can see that, bad rating wine locates on left upper side, which means has low alcohol and high density And good rating wine locates on right bettom side, which means has high alcohol and low density
We can see that, good quality wine locates on left upper side, which means has lowwer volatile.acidity and higher sulphates And low quality wine locates on right bettom side, which means has higher volatile.acidity and lowwer sulphates
At first I try to show the summary of the data, to get basic understanding of the data. For example, how many variables does the data have, what’s the variable’s min,mean,max… Then, in order to get more info of individual variables, I try explore individual variable, and plot some individual varialbe histogram. This plots show how the variable distributed, whether or not there are many outliers, does the data collected make sense(reasonable). For example, it is weired that the 8% of data has citric.acid value being 0.
In “bivariate” section, I try to show each valuable’s influence on ‘quality’. It surprised me that ‘residual.sugar’ has no notable influence on wine’s quality. And although the acid related features has high correlation with wine’s quality, ‘pH’ does’t have this relation. Maybe because there are too many factors can affect the pH value.
In ‘multivariate’ section, I try to investigate whether combination of variables can affect wine quality. For example, most of better quality wine have low volatile.acidity and high sulphates. But here, I get an question. Since in the ‘bivariate analysis’, it already shows lower volatile.acidity has better quality, and higher sulphates has better quality. Can we say volatile.acidity and sulphates strengthen each other for quality? I think for better answer this question, need to know more analysis knowledge.
This wine data contains 12 variables, including 11 physicochemical valuables and one varialble ‘quality’ which we care about. The number of observations are 1599. So this can be viewed as an regression task. For enrich the analysis, it can use linear regression or other machine learning task, by input the features to predict the output(quality).